Pivot-Based Topic Models for Low-Resource Lexicon Extraction
نویسندگان
چکیده
This paper proposes a range of solutions to the challenges of extracting large and highquality bilingual lexicons for low-resource language pairs. In such scenarios there is often no parallel or even comparable data available. We design three effective pivotbased approaches inspired by the state-ofthe-art technique of bilingual topic modelling, extending previous work to take advantage of trilingual data. The proposed models are shown to outperform traditional methods significantly and can be adapted based upon the nature of available training data. We demonstrate the accuracy of these pivot-based approaches in a realistic scenario generating an IcelandicKorean lexicon from Wikipedia.
منابع مشابه
Pivot, Box and Trilingual: Lexicon Extraction for Low-Resource Language Pairs with Extended Topic Models
Data-driven approaches to natural language processing have been shown to be greatly effective, and the case of bilingual lexicon extraction is no exception. While training data is readily available for many language pairs, many existing approaches fail for languages for which there simply does not exist parallel data. While there have been many studies on bilingual lexicon extraction, there has...
متن کاملEvaluating a Pivot-Based Approach for Bilingual Lexicon Extraction
A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word as...
متن کاملConstraint-Based Bilingual Lexicon Induction for Closely Related Languages
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction becomes a difficult task for low-resource languages. Pivot language and cognate recognition approach have been proven useful to induce bilingual lexicons for such languages. We analyze the features of closely related languages and define a semantic constraint assumption. Based on the assumption, we propose...
متن کاملBilingual Multi-Word Lexicon Construction via a Pivot Language
Bilingual multi-word lexicons are helpful for statistical machine translation systems to improve their performance. In this paper we present a method for constructing such lexicons in a resource-poor language pair such as Korean-French. By using two parallel corpora sharing one pivot language we can easily construct such lexicons without any external language resource like a seed dictionary. Th...
متن کاملAugmenting Phrase Table by Employing Lexicons for Pivot-based SMT
Pivot language is employed as a way to solve the data sparseness problem in machine translation, especially when the data for a particular language pair does not exist. The combination of source-to-pivot and pivot-to-target translation models can induce a new translation model through the pivot language. However, the errors in two models may compound as noise, and still, the combined model may ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015